In this study, the patients’ data who are diagnosed with the disease are analyzed. Using speech data from subjects is expected to help the development of a noninvasive diagnostic. People with Parkinsonism suffer from speech impairments like dysphonia (defective use of the voice), hypophonia (reduced volume), monotone (reduced pitch range), and dysarthria (difficulty with articulation of sounds or syllables). Therefore, the analysis in this project will be based on voice parameters of the affected.
The dataset was created by Athanasios Tsanas and Max Little of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation who developed the tele-monitoring device to record the speech signals.
This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited to a six-month trial of a tele-monitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient’s homes.
Columns in the dataset contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores (‘motor_UPDRS’ and ‘total_UPDRS’) from the 16 voice measures. The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around 200 recordings per patient, the subject number of the patient is identified in the first column.
Subject: Integer that uniquely identifies each subject Age: Subject age Sex: Subject gender ‘0’ - male, ‘1’ - female Test_time: Time since recruitment into the trial. The integer part is the number of days since recruitment Motor_UPDRS: Clinician’s motor UPDRS score, linearly interpolated Total_UPDRS: Clinician’s total UPDRS score, linearly interpolated Jitter (%), Jitter(Abs), Jitter. RAP, Jitter. PPQ5, Jitter. DDP: Several measures of variation in fundamental frequency (Frequency parameters) Shimmer, Shimmer (dB), Shimmer. APQ3, Shimmer. APQ5, Shimmer. APQ11, Shimmer. DDA: Several measures of variation in amplitude (Amplitude parameters) NHR, HNR: Two measures of ratio of noise to tonal components in the voice RPDE: A nonlinear dynamical complexity measure DFA: Signal fractal scaling exponent PPE: A nonlinear measure of fundamental frequency variation
library(ggplot2)
library(caret)
## Loading required package: lattice
library(naniar)
library(MASS)
library(lattice)
library(e1071)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(corrplot)
## corrplot 0.84 loaded
library(knitr)
library(mvnormtest)
library(MVA)
## Loading required package: HSAUR2
## Loading required package: tools
dataurl <- ('http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data')
download.file(url = dataurl, destfile = "parkinsons_updrs.data")
parkinsons_df <- read.csv("parkinsons_updrs.data",header = TRUE,sep = ',')
str(parkinsons_df)
## 'data.frame': 5875 obs. of 22 variables:
## $ subject. : int 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 72 72 72 72 72 72 72 72 72 72 ...
## $ sex : int 0 0 0 0 0 0 0 0 0 0 ...
## $ test_time : num 5.64 12.67 19.68 25.65 33.64 ...
## $ motor_UPDRS : num 28.2 28.4 28.7 28.9 29.2 ...
## $ total_UPDRS : num 34.4 34.9 35.4 35.8 36.4 ...
## $ Jitter... : num 0.00662 0.003 0.00481 0.00528 0.00335 0.00353 0.00422 0.00476 0.00432 0.00496 ...
## $ Jitter.Abs. : num 3.38e-05 1.68e-05 2.46e-05 2.66e-05 2.01e-05 ...
## $ Jitter.RAP : num 0.00401 0.00132 0.00205 0.00191 0.00093 0.00119 0.00212 0.00226 0.00156 0.00258 ...
## $ Jitter.PPQ5 : num 0.00317 0.0015 0.00208 0.00264 0.0013 0.00159 0.00221 0.00259 0.00207 0.00253 ...
## $ Jitter.DDP : num 0.01204 0.00395 0.00616 0.00573 0.00278 ...
## $ Shimmer : num 0.0256 0.0202 0.0168 0.0231 0.017 ...
## $ Shimmer.dB. : num 0.23 0.179 0.181 0.327 0.176 0.214 0.445 0.212 0.371 0.31 ...
## $ Shimmer.APQ3 : num 0.01438 0.00994 0.00734 0.01106 0.00679 ...
## $ Shimmer.APQ5 : num 0.01309 0.01072 0.00844 0.01265 0.00929 ...
## $ Shimmer.APQ11: num 0.0166 0.0169 0.0146 0.0196 0.0182 ...
## $ Shimmer.DDA : num 0.0431 0.0298 0.022 0.0332 0.0204 ...
## $ NHR : num 0.0143 0.0111 0.0202 0.0278 0.0116 ...
## $ HNR : num 21.6 27.2 23 24.4 26.1 ...
## $ RPDE : num 0.419 0.435 0.462 0.487 0.472 ...
## $ DFA : num 0.548 0.565 0.544 0.578 0.561 ...
## $ PPE : num 0.16 0.108 0.21 0.333 0.194 ...
colnames(parkinsons_df)
## [1] "subject." "age" "sex" "test_time"
## [5] "motor_UPDRS" "total_UPDRS" "Jitter..." "Jitter.Abs."
## [9] "Jitter.RAP" "Jitter.PPQ5" "Jitter.DDP" "Shimmer"
## [13] "Shimmer.dB." "Shimmer.APQ3" "Shimmer.APQ5" "Shimmer.APQ11"
## [17] "Shimmer.DDA" "NHR" "HNR" "RPDE"
## [21] "DFA" "PPE"
summary(parkinsons_df)
## subject. age sex test_time
## Min. : 1.00 Min. :36.0 Min. :0.0000 Min. : -4.263
## 1st Qu.:10.00 1st Qu.:58.0 1st Qu.:0.0000 1st Qu.: 46.847
## Median :22.00 Median :65.0 Median :0.0000 Median : 91.523
## Mean :21.49 Mean :64.8 Mean :0.3178 Mean : 92.864
## 3rd Qu.:33.00 3rd Qu.:72.0 3rd Qu.:1.0000 3rd Qu.:138.445
## Max. :42.00 Max. :85.0 Max. :1.0000 Max. :215.490
## motor_UPDRS total_UPDRS Jitter... Jitter.Abs.
## Min. : 5.038 Min. : 7.00 Min. :0.000830 Min. :2.250e-06
## 1st Qu.:15.000 1st Qu.:21.37 1st Qu.:0.003580 1st Qu.:2.244e-05
## Median :20.871 Median :27.58 Median :0.004900 Median :3.453e-05
## Mean :21.296 Mean :29.02 Mean :0.006154 Mean :4.403e-05
## 3rd Qu.:27.596 3rd Qu.:36.40 3rd Qu.:0.006800 3rd Qu.:5.333e-05
## Max. :39.511 Max. :54.99 Max. :0.099990 Max. :4.456e-04
## Jitter.RAP Jitter.PPQ5 Jitter.DDP Shimmer
## Min. :0.000330 Min. :0.000430 Min. :0.000980 Min. :0.00306
## 1st Qu.:0.001580 1st Qu.:0.001820 1st Qu.:0.004730 1st Qu.:0.01912
## Median :0.002250 Median :0.002490 Median :0.006750 Median :0.02751
## Mean :0.002987 Mean :0.003277 Mean :0.008962 Mean :0.03404
## 3rd Qu.:0.003290 3rd Qu.:0.003460 3rd Qu.:0.009870 3rd Qu.:0.03975
## Max. :0.057540 Max. :0.069560 Max. :0.172630 Max. :0.26863
## Shimmer.dB. Shimmer.APQ3 Shimmer.APQ5 Shimmer.APQ11
## Min. :0.026 Min. :0.00161 Min. :0.00194 Min. :0.00249
## 1st Qu.:0.175 1st Qu.:0.00928 1st Qu.:0.01079 1st Qu.:0.01566
## Median :0.253 Median :0.01370 Median :0.01594 Median :0.02271
## Mean :0.311 Mean :0.01716 Mean :0.02014 Mean :0.02748
## 3rd Qu.:0.365 3rd Qu.:0.02057 3rd Qu.:0.02375 3rd Qu.:0.03272
## Max. :2.107 Max. :0.16267 Max. :0.16702 Max. :0.27546
## Shimmer.DDA NHR HNR RPDE
## Min. :0.00484 Min. :0.000286 Min. : 1.659 Min. :0.1510
## 1st Qu.:0.02783 1st Qu.:0.010955 1st Qu.:19.406 1st Qu.:0.4698
## Median :0.04111 Median :0.018448 Median :21.920 Median :0.5423
## Mean :0.05147 Mean :0.032120 Mean :21.680 Mean :0.5415
## 3rd Qu.:0.06173 3rd Qu.:0.031463 3rd Qu.:24.444 3rd Qu.:0.6140
## Max. :0.48802 Max. :0.748260 Max. :37.875 Max. :0.9661
## DFA PPE
## Min. :0.5140 Min. :0.02198
## 1st Qu.:0.5962 1st Qu.:0.15634
## Median :0.6436 Median :0.20550
## Mean :0.6532 Mean :0.21959
## 3rd Qu.:0.7113 3rd Qu.:0.26449
## Max. :0.8656 Max. :0.73173
parkinsons_df <- parkinsons_df[!duplicated(parkinsons_df),]
dim(parkinsons_df)
## [1] 5875 22
Data cleaning and finding missing values and outliers. This step belongs to data preparation.
vis_miss(parkinsons_df)
sum(is.na(parkinsons_df))
## [1] 0
From the above plot we see that there is no missing values or outliers.
cm <- colMeans(parkinsons_df)
Cov <- cov(parkinsons_df)
d <- apply(parkinsons_df,1,function(parkinsons_df) t(parkinsons_df-cm)%*% solve(Cov) %*% (parkinsons_df-cm))
From the correlation plot we see how the variables relate to each other.
plot(qchisq((1:nrow(parkinsons_df)-1/2)/nrow(parkinsons_df),df=ncol(parkinsons_df)),
sort(d),
xlab = expression(paste(chi[22]^2, "Quantile")), ylab = "Ordered distances")
abline(a = 0, b = 1)
From the above graph, we use the Chi-squared distribution for the Parkinson’s data. The x-axis has the chi-squared qauntile and y-axis has the distance.The data is distributed over the graph.
missing <- apply(parkinsons_df, 2, function(parkinsons_df)
round(100 * (length(which(is.na(parkinsons_df))))/length(parkinsons_df) , digits = 1))
as.data.frame(missing)
## missing
## subject. 0
## age 0
## sex 0
## test_time 0
## motor_UPDRS 0
## total_UPDRS 0
## Jitter... 0
## Jitter.Abs. 0
## Jitter.RAP 0
## Jitter.PPQ5 0
## Jitter.DDP 0
## Shimmer 0
## Shimmer.dB. 0
## Shimmer.APQ3 0
## Shimmer.APQ5 0
## Shimmer.APQ11 0
## Shimmer.DDA 0
## NHR 0
## HNR 0
## RPDE 0
## DFA 0
## PPE 0
corrplot(cor(parkinsons_df), type="full", method ="color", title = "Parkinsons correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")
corrplot(cor(parkinsons_df), type="full", method ="color", title = "Parkinsons correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")
summary(parkinsons_df[,-3])
## subject. age test_time motor_UPDRS
## Min. : 1.00 Min. :36.0 Min. : -4.263 Min. : 5.038
## 1st Qu.:10.00 1st Qu.:58.0 1st Qu.: 46.847 1st Qu.:15.000
## Median :22.00 Median :65.0 Median : 91.523 Median :20.871
## Mean :21.49 Mean :64.8 Mean : 92.864 Mean :21.296
## 3rd Qu.:33.00 3rd Qu.:72.0 3rd Qu.:138.445 3rd Qu.:27.596
## Max. :42.00 Max. :85.0 Max. :215.490 Max. :39.511
## total_UPDRS Jitter... Jitter.Abs. Jitter.RAP
## Min. : 7.00 Min. :0.000830 Min. :2.250e-06 Min. :0.000330
## 1st Qu.:21.37 1st Qu.:0.003580 1st Qu.:2.244e-05 1st Qu.:0.001580
## Median :27.58 Median :0.004900 Median :3.453e-05 Median :0.002250
## Mean :29.02 Mean :0.006154 Mean :4.403e-05 Mean :0.002987
## 3rd Qu.:36.40 3rd Qu.:0.006800 3rd Qu.:5.333e-05 3rd Qu.:0.003290
## Max. :54.99 Max. :0.099990 Max. :4.456e-04 Max. :0.057540
## Jitter.PPQ5 Jitter.DDP Shimmer Shimmer.dB.
## Min. :0.000430 Min. :0.000980 Min. :0.00306 Min. :0.026
## 1st Qu.:0.001820 1st Qu.:0.004730 1st Qu.:0.01912 1st Qu.:0.175
## Median :0.002490 Median :0.006750 Median :0.02751 Median :0.253
## Mean :0.003277 Mean :0.008962 Mean :0.03404 Mean :0.311
## 3rd Qu.:0.003460 3rd Qu.:0.009870 3rd Qu.:0.03975 3rd Qu.:0.365
## Max. :0.069560 Max. :0.172630 Max. :0.26863 Max. :2.107
## Shimmer.APQ3 Shimmer.APQ5 Shimmer.APQ11 Shimmer.DDA
## Min. :0.00161 Min. :0.00194 Min. :0.00249 Min. :0.00484
## 1st Qu.:0.00928 1st Qu.:0.01079 1st Qu.:0.01566 1st Qu.:0.02783
## Median :0.01370 Median :0.01594 Median :0.02271 Median :0.04111
## Mean :0.01716 Mean :0.02014 Mean :0.02748 Mean :0.05147
## 3rd Qu.:0.02057 3rd Qu.:0.02375 3rd Qu.:0.03272 3rd Qu.:0.06173
## Max. :0.16267 Max. :0.16702 Max. :0.27546 Max. :0.48802
## NHR HNR RPDE DFA
## Min. :0.000286 Min. : 1.659 Min. :0.1510 Min. :0.5140
## 1st Qu.:0.010955 1st Qu.:19.406 1st Qu.:0.4698 1st Qu.:0.5962
## Median :0.018448 Median :21.920 Median :0.5423 Median :0.6436
## Mean :0.032120 Mean :21.680 Mean :0.5415 Mean :0.6532
## 3rd Qu.:0.031463 3rd Qu.:24.444 3rd Qu.:0.6140 3rd Qu.:0.7113
## Max. :0.748260 Max. :37.875 Max. :0.9661 Max. :0.8656
## PPE
## Min. :0.02198
## 1st Qu.:0.15634
## Median :0.20550
## Mean :0.21959
## 3rd Qu.:0.26449
## Max. :0.73173
From the above correlation plot, we can see how the attributes are correlated to one another. We see that HNR is negatively correlated to all the other attributes. We also summarise about all the attributes to statistically know where and how they influence the prediction and how accurate it could be for detection.
Next, we plot scatter plots in order to identify if there are outlier influences.
plot(jitter(total_UPDRS)~.,parkinsons_df)
In the scatter plots between total_UPDRS vs Jitter, we observe some outliers in the data. Similary, we observe the same in the other plots. We use total_UPDRS because that attribute is the dependant variable necessary for the prediction. Next, we recheck and confirm for the outliers using bivariate boxplot.
#Bivariate Box plot for checking for outliers.
bvbox(parkinsons_df[,6:7],xlab = "total_UPDRS", ylab = "Jitter")
bvbox(parkinsons_df[,c(6,12)],xlab = "total_UPDRS", ylab = "Shimmer")
bvbox(parkinsons_df[,c(6,18)],xlab = "total_UPDRS", ylab = "NHR")
bvbox(parkinsons_df[,c(6,20)],xlab = "total_UPDRS", ylab = "RPDE")
bvbox(parkinsons_df[,c(6,21)],xlab = "total_UPDRS", ylab = "DFA")
bvbox(parkinsons_df[,c(6,22)],xlab = "total_UPDRS", ylab = "PPE")
From the bivariate boxplots, we can see that the data has a considerable number of outliers and hence, data cleaning is required.We will have to remove outliers however, we do not wish to change the distribution, hence, we use the convex hull method.
hull1 <- chull(parkinsons_df[,6:7])
parkhull <- match(lab <- rownames(parkinsons_df[hull1,]),rownames(parkinsons_df))
plot(parkinsons_df[,6:7],xlab = "total_UPDRS",ylab = "Jitter")
polygon(parkinsons_df$Jitter...[hull1]~parkinsons_df$total_UPDRS[hull1])
text(parkinsons_df[parkhull,6:7],labels = lab, pch=".", cex = 0.9)
We use this method to remove the outliers present in our data without affecting its distribution.
outlier <- parkinsons_df[-hull1,]
dim(outlier)
## [1] 5858 22
dim(parkinsons_df)
## [1] 5875 22
hull2 <- chull(outlier[,c(6,12)])
parkinsons_df <- outlier[-hull2,]
hull3 <- chull(parkinsons_df[,c(6,18)])
outlier <- parkinsons_df[-hull3,]
hull4 <- chull(outlier[,c(6,20)])
parkinsons_df <- outlier[-hull4,]
hull5 <- chull(parkinsons_df[,c(6,21)])
outlier <- parkinsons_df[-hull5,]
hull6 <- chull(outlier[,c(6,22)])
parkinsons_df <- outlier[-hull6,]
dim(parkinsons_df)
## [1] 5772 22
We reduce the number of variables as it is of high number (22 variables) and some variables have high correlations between them.
Multidimentional scaling is done to visualize relationships in 2D.
parkinsoncorr <- cor(parkinsons_df)
colnames(parkinsoncorr) <- row.names(parkinsoncorr) <- parkinsonlabs <- c(colnames(parkinsons_df))
rnge <- sapply(parkinsons_df, function(parkinsons_df) diff(range(parkinsons_df)))
S_parkinsons <- sweep(parkinsons_df, 2, rnge, FUN = "/")
parkinsondist <- dist(S_parkinsons)
parkinsondist_mds <- cmdscale(parkinsondist, k = 21, eig = TRUE)
parkinsondistpoints <- parkinsondist_mds$points
lam <- parkinsondist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
x <- parkinsondist_mds$points[,1]
y <- parkinsondist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = parkinsons_df[,3], cex=0.8)
As we know from the data that 0: indicates male and 1: indicates female, we see from the MDS plot that there is a clear significant deviation. This is because of voice pitch, frequency, amplitude. Age also seems to be creating a significant deviation.
parkinsoncorr <- cor(parkinsons_df)
colnames(parkinsoncorr) <- row.names(parkinsoncorr) <- parkinsonlabs <- c(colnames(parkinsons_df))
rnge <- sapply(parkinsons_df, function(parkinsons_df) diff(range(parkinsons_df)))
S_parkinsons <- sweep(parkinsons_df, 2, rnge, FUN = "/")
parkinsondist <- dist(parkinsoncorr)
parkinsondist_mds <- cmdscale(parkinsondist, k = 21, eig = TRUE)
parkinsondistpoints <- parkinsondist_mds$points
lam <- parkinsondist_mds$eig
criterion1 <- cumsum(abs(lam)) / sum(abs(lam))
criterion2 <- cumsum(lam^2) / sum(lam^2)
#criterion 1 and criterion 2 suggests that the first two coordinates can represents majority of the data points since the cummulative proportion is above the threshold value of 0.8
#hence the MDS plot can be on a 2D scatterplot
x <- parkinsondist_mds$points[,1]
y <- parkinsondist_mds$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Parkinsons MDS",pch=20,cex=0.1)
text(x, y, labels = colnames(parkinsoncorr), cex=0.8)
From the multidimensional scaling on the data, we see that the data attributes follow a pattern with each other. We can see that Jitter variables is related to frequency and Shimmer variables are related to amplitude. Motor_UPDRS influences total_UPDRS. Age influences and contributes to the variation in data. Test_time and Sex influences the variation as well.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
#scaling the data
rnge <- sapply(parkinsons_df, function(parkinsons_df) diff(range(parkinsons_df)))
S_parkinsons <- sweep(parkinsons_df, 2, rnge, FUN = "/")
# Create the forest.
output.forest <- randomForest(S_parkinsons$total_UPDRS~age+sex+test_time+Jitter...+Jitter.Abs.+Jitter.RAP+Jitter.PPQ5+Jitter.DDP+Shimmer+Shimmer.dB.+Shimmer.APQ3+Shimmer.APQ5+Shimmer.APQ11+Shimmer.DDA+NHR+HNR+RPDE+DFA+PPE, data = S_parkinsons,mtry = 6)
# View the forest results.
print(output.forest)
##
## Call:
## randomForest(formula = S_parkinsons$total_UPDRS ~ age + sex + test_time + Jitter... + Jitter.Abs. + Jitter.RAP + Jitter.PPQ5 + Jitter.DDP + Shimmer + Shimmer.dB. + Shimmer.APQ3 + Shimmer.APQ5 + Shimmer.APQ11 + Shimmer.DDA + NHR + HNR + RPDE + DFA + PPE, data = S_parkinsons, mtry = 6)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.00620812
## % Var explained: 86.91
Random forest has helped in indentifying the top factors that influence the disease progression and that are DFA, Age, JItter.Abs.,Sex, PPE, HNR, RPDE and test_time.
# Importance of each predictor.
impfactors <- importance(output.forest,type = 2)
impfactors <- data.frame(impfactors)
impfactorsranked <- impfactors[order(-impfactors$IncNodePurity),,drop=FALSE]
print(impfactorsranked)
## IncNodePurity
## age 107.219773
## DFA 30.287162
## test_time 13.318873
## Jitter.Abs. 13.101650
## HNR 13.071159
## RPDE 12.777931
## sex 10.614591
## PPE 10.293356
## NHR 7.515697
## Shimmer.APQ11 6.360860
## Shimmer.APQ3 5.650422
## Shimmer.DDA 5.579527
## Shimmer.APQ5 5.507955
## Jitter.PPQ5 5.183703
## Jitter.DDP 4.942593
## Jitter... 4.828989
## Shimmer 4.647830
## Jitter.RAP 4.553628
## Shimmer.dB. 4.508412
Above we see, the importance of each predictor we have taken.
#Exploratory factor analysis
library(MVA)
options(digits = 3)
# EFA
#head(parkinsons) #2:4,8,16,18:22
parkinson.EFA <- factanal(parkinsons_df[, c(2:5,8,16,18:22)], 3, n.obs = nrow(parkinsons_df), rotation="varimax", control=list(trace=T))
## start 1 value: 0.261 uniqs: 0.9705 0.8226 0.9983 0.9623 0.0050 0.2976 0.0050 0.0777 0.5282 0.6720 0.2671
parkinson.EFA
##
## Call:
## factanal(x = parkinsons_df[, c(2:5, 8, 16, 18:22)], factors = 3, n.obs = nrow(parkinsons_df), rotation = "varimax", control = list(trace = T))
##
## Uniquenesses:
## age sex test_time motor_UPDRS Jitter.Abs.
## 0.971 0.823 0.998 0.962 0.005
## Shimmer.APQ11 NHR HNR RPDE DFA
## 0.298 0.005 0.078 0.528 0.672
## PPE
## 0.267
##
## Loadings:
## Factor1 Factor2 Factor3
## age 0.162
## sex -0.416
## test_time
## motor_UPDRS 0.187
## Jitter.Abs. 0.910 0.408
## Shimmer.APQ11 0.651 0.523
## NHR 0.915 0.159 -0.365
## HNR -0.689 -0.657 -0.131
## RPDE 0.478 0.430 0.241
## DFA 0.145 0.161 0.530
## PPE 0.692 0.354 0.359
##
## Factor1 Factor2 Factor3
## SS loadings 3.294 1.134 0.965
## Proportion Var 0.299 0.103 0.088
## Cumulative Var 0.299 0.403 0.490
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1506 on 25 degrees of freedom.
## The p-value is 7.7e-303
print(parkinson.EFA$loadings, cut = 0.40)
##
## Loadings:
## Factor1 Factor2 Factor3
## age
## sex -0.416
## test_time
## motor_UPDRS
## Jitter.Abs. 0.910 0.408
## Shimmer.APQ11 0.651 0.523
## NHR 0.915
## HNR -0.689 -0.657
## RPDE 0.478 0.430
## DFA 0.530
## PPE 0.692
##
## Factor1 Factor2 Factor3
## SS loadings 3.294 1.134 0.965
## Proportion Var 0.299 0.103 0.088
## Cumulative Var 0.299 0.403 0.490
parkinson.EFA <- factanal(parkinsons_df[, c(2:8,17,18:22)], 2, n.obs = nrow(parkinsons_df), rotation="varimax", control=list(trace=T))
## start 1 value: 2.27 uniqs: 0.906 0.993 0.994 0.102 0.005 0.102 0.140 0.446 0.332 0.353 0.671 0.894 0.332
parkinson.EFA
##
## Call:
## factanal(x = parkinsons_df[, c(2:8, 17, 18:22)], factors = 2, n.obs = nrow(parkinsons_df), rotation = "varimax", control = list(trace = T))
##
## Uniquenesses:
## age sex test_time motor_UPDRS total_UPDRS Jitter...
## 0.906 0.993 0.993 0.102 0.005 0.102
## Jitter.Abs. Shimmer.DDA NHR HNR RPDE DFA
## 0.140 0.446 0.332 0.353 0.671 0.894
## PPE
## 0.332
##
## Loadings:
## Factor1 Factor2
## age 0.303
## sex
## test_time
## motor_UPDRS 0.946
## total_UPDRS 0.996
## Jitter... 0.947
## Jitter.Abs. 0.927
## Shimmer.DDA 0.743
## NHR 0.817
## HNR -0.798 -0.106
## RPDE 0.561 0.120
## DFA 0.286 -0.154
## PPE 0.811
##
## Factor1 Factor2
## SS loadings 4.68 2.052
## Proportion Var 0.36 0.158
## Cumulative Var 0.36 0.518
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 13107 on 53 degrees of freedom.
## The p-value is 0
print(parkinson.EFA$loadings, cut = 0.5)
##
## Loadings:
## Factor1 Factor2
## age
## sex
## test_time
## motor_UPDRS 0.946
## total_UPDRS 0.996
## Jitter... 0.947
## Jitter.Abs. 0.927
## Shimmer.DDA 0.743
## NHR 0.817
## HNR -0.798
## RPDE 0.561
## DFA
## PPE 0.811
##
## Factor1 Factor2
## SS loadings 4.68 2.052
## Proportion Var 0.36 0.158
## Cumulative Var 0.36 0.518
From Exploratory Factor Analysis, we try to identify important factors. Certain attributes contribute higher to the split. We observe that we take 2 or 3 factors. Age, sex and test_time have very small factor coefficient and large uniqueness. 3 factor analysis we can see that jitter, shimmer, NHR, HNR, RPDE and PPE have higher coefficents with Factor 1.
We use principle component Anaysis to score the importance of each attribute variable. This solves the multicollinearity problem.
library(stats)
#outliers have alreaady been removed so PCA does not requiere any changes in the data
#standard deviations of data set
p_sd <- sd(is.numeric(parkinsons_df))
# creating covariance matrix for entire dataset
p_cov <- cov(parkinsons_df, use = "everything")
#creating correlation matrix for the entire dataset
p_corr <-cor(parkinsons_df, use = "everything")
#Principal Components Analysis for correlation matrix
# we have chosen to utilize the correlation matrix for the PCA since the variables have different scales and variances
parkinsons_pca_corr <- princomp(parkinsons_df, cor = T, scores = TRUE)
summary(parkinsons_pca_corr, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 3.376 1.492 1.3134 1.2095 1.0758 1.0009 0.9112 0.8349
## Proportion of Variance 0.518 0.101 0.0784 0.0665 0.0526 0.0455 0.0377 0.0317
## Cumulative Proportion 0.518 0.619 0.6976 0.7641 0.8167 0.8623 0.9000 0.9317
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.712 0.5402 0.45626 0.40866 0.31880 0.29550 0.22600
## Proportion of Variance 0.023 0.0133 0.00946 0.00759 0.00462 0.00397 0.00232
## Cumulative Proportion 0.955 0.9680 0.97742 0.98502 0.98964 0.99360 0.99593
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## Standard deviation 0.20713 0.143734 0.112470 0.095625 0.065415 7.02e-04
## Proportion of Variance 0.00195 0.000939 0.000575 0.000416 0.000195 2.24e-08
## Cumulative Proportion 0.99788 0.998815 0.999390 0.999805 1.000000 1.00e+00
## Comp.22
## Standard deviation 1.49e-04
## Proportion of Variance 1.01e-09
## Cumulative Proportion 1.00e+00
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## subject. -0.212 -0.334 0.649 -0.111 0.626
## age -0.313 0.168 -0.302 -0.845 0.217
## sex 0.271 -0.599 0.229 -0.353 -0.590
## test_time -0.974 -0.197
## motor_UPDRS -0.616 -0.131 0.195 0.189 -0.196
## total_UPDRS -0.624 -0.149 0.183 0.145 -0.114
## Jitter... 0.267 -0.225 -0.190 -0.133
## Jitter.Abs. 0.249 -0.335
## Jitter.RAP 0.259 -0.224 -0.226 -0.155
## Jitter.PPQ5 0.265 -0.134 -0.218 -0.145
## Jitter.DDP 0.259 -0.224 -0.226 -0.155
## Shimmer 0.276 0.246
## Shimmer.dB. 0.277 0.234
## Shimmer.APQ3 0.269 0.257 0.110 0.111
## Shimmer.APQ5 0.272 0.261
## Shimmer.APQ11 0.258 0.237 0.166 0.103
## Shimmer.DDA 0.269 0.257 0.110 0.111
## NHR 0.257 -0.224 -0.131 -0.154
## HNR -0.257 -0.144 -0.181 0.158 0.188
## RPDE 0.168 -0.166 0.271 0.205 0.148 -0.715 -0.233
## DFA 0.186 -0.345 0.273 0.471 -0.116 -0.217 0.523 -0.173
## PPE 0.229 -0.279 0.125 0.110 -0.135 -0.147
## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## subject.
## age
## sex -0.101
## test_time
## motor_UPDRS 0.112 0.541 -0.411
## total_UPDRS -0.161 -0.561 0.404
## Jitter... 0.171 0.136
## Jitter.Abs. 0.227 0.472 -0.573 -0.151 -0.397
## Jitter.RAP -0.123 0.224 0.158 0.183 0.311
## Jitter.PPQ5 -0.304 -0.210 0.395 -0.416 -0.426
## Jitter.DDP -0.123 0.224 0.158 0.183 0.311
## Shimmer 0.243
## Shimmer.dB. 0.116 0.771
## Shimmer.APQ3 0.316 -0.254 -0.100 -0.186
## Shimmer.APQ5 -0.119 -0.228 -0.149 -0.430
## Shimmer.APQ11 -0.241 0.315 0.625 0.283 0.223 -0.165
## Shimmer.DDA 0.316 -0.254 -0.100 -0.186
## NHR -0.642 -0.169 -0.496 0.243 0.196 -0.120
## HNR -0.168 0.848 -0.216 -0.119
## RPDE -0.393 0.236 -0.126
## DFA -0.331 -0.222 -0.109
## PPE 0.792 0.235 -0.312
## Comp.18 Comp.19 Comp.20 Comp.21 Comp.22
## subject.
## age
## sex
## test_time
## motor_UPDRS
## total_UPDRS
## Jitter... -0.136 0.849 -0.143
## Jitter.Abs. -0.104
## Jitter.RAP 0.156 -0.206 -0.707
## Jitter.PPQ5 -0.224 -0.342
## Jitter.DDP 0.156 -0.206 0.707
## Shimmer -0.161 -0.858
## Shimmer.dB. 0.216 0.433
## Shimmer.APQ3 -0.311 0.126 -0.707
## Shimmer.APQ5 0.715 0.175 0.113
## Shimmer.APQ11 -0.329
## Shimmer.DDA -0.311 0.126 0.707
## NHR
## HNR
## RPDE
## DFA
## PPE
From Principle component Analysis shows that attributes 1 thorugh 4 are important. Variables containing the word “Shimmer” and Variables containing the word “Jitter” have high positive correlation values with each other. “total_UPDRS” and “motor_UPDRS” showcase a strong positive correlation. PPE variable has strong positive correlations with varaibles containing the word “jitters” or “Shimmers” and the variable RPDE. Variable HNR has an inverse (negative in the graphs) releationship with all the variables. The data has high correlation between the variable groups “Jitter” and “Shimmer” which caused problems related to multi-collinearity during the analysis.
[1] J. Jankovic, “Parkinson’s disease: Clinical features and diagnosis,” J. Neurol. Neurosurgery Psychiatry, vol. 79, no. 4, pp. 368–376, 2007. [2] S. B. O’Sullivan and T. J. Schmitz, “Parkinson disease,” in Physical Rehabilitation, 5th ed. Philadelphia, PA, USA: F. A. Davis Company, 2007, pp. 856–894. [3] Parkinson Derne˘gi. (2011). [Online]. Available: http://www. parkinsondernegi.org/Icerik.aspx?Page=parkinsonnedir&ID=5 [4] L. M. de Lau and M. M. Breteler, “Epidemiology of Parkinson’s disease,” Lancet Neurol., vol. 5, no. 6, pp. 525–535, 2006. [5] N. Singh, V. Pillay, and Y. E. Choonara, “Advances in the treatment of Parkinson’s disease,” Prog. Neurobiol., vol. 81, no. 1, pp. 29–44, 2007. [6] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, and L. O. Ramig, “Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease,” IEEE Trans. Biomed. Eng., vol. 56, no. 4, pp. 1010–1022, Apr. 2009. [7] National Collaborating Centre for Chronic Conditions, Parkinson’s Disease, London, U.K.: Royal College of Physicians, 2006. [8] Betul Erdogdu, SakarMuhammed, Erdem Isenkul, Muhammed Erdem, IsenkulC. Okan, SakarC. and Okan Sakar, “ Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings”, July 2013, IEEE Journal of Biomedical and Health Informatics 17(4):828-834, DOI: 10.1109/JBHI.2013.2245674 [9] Athanasios Tsanas and Max Little, ‘Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning’ [10] Parkinsons Telemonitoring Data Set , Online link: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.names [11] Athanasios Tsanas, Max A. Little, Patrick E. McSharry, Lorraine O. Ramig (2009), ‘Accurate telemonitoring of Parkinson.s disease progression by non-invasive speech tests’, IEEE Transactions on Biomedical Engineering. [12] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009), ‘Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease’, IEEE Transactions on Biomedical Engineering, 56(4):1015-1022